Two Dimensional Data Worksheet

This worksheet focuses on manipulating two dimensional data using Python and Pandas.


In [1]:
%pylab inline
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None


Populating the interactive namespace from numpy and matplotlib

In [4]:
#Create a dataframe called twitter data from the CSV file
#Note if this is breaking your machine there is a smaller data set in the data file called twitter1-small.csv
twitterData = pd.read_csv( '../../data/twitter1.csv', encoding='iso8859_15' )

Exercise 1

Using the twitterData DataFrame and the commands we have learned thus far and create a Series called tweetCounts which contains the user name and how many tweets each user posted. Next, output the top 10 "tweeters".


In [5]:
tweetCounts = twitterData['Username'].value_counts()
tweetCounts.head(10)


Out[5]:
HoolohaTube        155
Rasu24             150
HOOLOHASPORT       126
mahboobali3        119
EminemsRealWife    116
byezekiel           89
MyrtleMuelr         83
LucindaFischer      79
DebraRichayd        77
JeanieNoble         70
Name: Username, dtype: int64

Exercise 2

Using the original twitter data set, create a second DataFrame called twitterSummary which contains the following columns:

  • Username
  • Friends
  • Followers

Next add a column called ffratio which contains the ratio of friends to followers.


In [6]:
twitterSummary = twitterData[['Username', 'Friends', 'Followers']]
twitterSummary['ffratio'] = twitterSummary['Friends'] / twitterSummary['Followers']

twitterSummary.head()


Out[6]:
Username Friends Followers ffratio
0 _prettybrown 1042 1538 0.677503
1 CarlyManning24 278 304 0.914474
2 madzLuvzLakers 619 1039 0.595765
3 _AyyJayy 203 204 0.995098
4 Akeemoneale 165 27 6.111111

Exercise 3

In the Data folder, there is a spreadsheet called studentData.csv consisting of students and test scores. Write a script which calculates each students' average test score and adds that as a column to the DataFrame. The first person to raise their hand and tell me which student has the highest average test score, and what it is wins something.


In [7]:
studentData = pd.read_csv('../../data/studentData.csv')

studentData['average'] = studentData[['Test1', 'Test2', 'Test3', 'Test4', 'Test5']].mean(axis=1)

studentData.sort_values('average', axis=0, ascending=False )


Out[7]:
Student ID Test1 Test2 Test3 Test4 Test5 average
8 9 0.90 0.85 0.76 0.83 0.99 0.866
6 7 0.88 0.59 0.94 0.92 0.83 0.832
0 1 0.78 0.96 0.82 0.63 0.88 0.814
7 8 0.93 0.70 0.92 0.91 0.56 0.804
2 3 0.66 0.80 0.97 0.77 0.80 0.800
1 2 0.95 0.80 0.70 0.82 0.72 0.798
9 10 0.67 0.84 0.54 0.73 0.89 0.734
3 4 0.73 0.67 0.85 0.68 0.59 0.704
4 5 0.54 0.76 0.65 0.54 0.92 0.682
5 6 0.70 0.70 0.60 0.54 0.63 0.634

Exercise 4

Using the twitter data, find all the users with Facebook accounts and create a new column called FacebookID which contains the users' Facebook ID. As you can see in the URL below, a user's Facebook ID can be found in the URL column, http://www.facebook.com/profile.php?id=5141860. Extract this by using the str.extract function. Don't forget to remove all the invalid or empty IDs.

We've already created a DataFrame for you in the cell above.


In [8]:
newData = twitterData[ twitterData['URL'].fillna("").str.contains('facebook') ]
newData['FacebookID'] = newData['URL'].str.extract( 'profile.php\?id=(\d+)', expand=False)
newData.dropna( inplace=True )

In [10]:
newData.head()


Out[10]:
Primary Key Service Term Username Name Update Location URL Friends Followers Time(PDT) City State/Region Country Metro Latitude Longitude FacebookID
14 15 twitter lakers MrBAAD Tashaun Williams @goodyCHOOshoes haha im sorry for you then... ... Miami http://www.facebook.com/profile.php?id=5141860... 187 143 6/3/2010 17:00 Miami FL US Miami-Fort Lauderdale-Pompano Beach FL 25.604410 -80.335216 514186015
66 67 twitter lakers,celtics HoneyHoward Jasmine Howard bout to cook dinner&&split this wig before the... Washington, D.C. http://www.facebook.com/home.php#/profile.php?... 349 1155 6/3/2010 17:00 Washington DC US Washington-Arlington-Alexandria DC-VA-MD-WV 38.950224 -77.019714 1707568551
84 85 twitter lakers Est_June3rd Chuck K hmmm im seein alot of new lakers fans on ma ti... Pontiac,MI http://www.facebook.com/profile.php?id=1060743... 1397 1606 6/3/2010 17:00 Pontiac MI US Detroit-Warren-Livonia MI 42.668599 -83.290343 1060743307
155 156 twitter celtics,lakers NGz_Swift Yung Crush Celtics finish smash the Lakers so I guess som... Los Angeles,CA http://www.facebook.com/#!/profile.php?id=1701... 238 574 6/3/2010 17:00 Los Angeles CA US Los Angeles-Long Beach-Santa Ana CA 34.009842 -118.258642 1701903898
638 639 twitter lakers GoodLookTy_BFA Tyquan Moore @Relly718 19 to 1 Lakers lol Brooklyn, New York http://www.facebook.com/profile.php?id=551167346 259 339 6/3/2010 17:02 Brooklyn NY US New York-Northern New Jersey-Long Island NY-NJ-PA 40.645412 -73.958730 551167346

In [ ]: